1,626 research outputs found

    BooookScore: A systematic exploration of book-length summarization in the era of LLMs

    Full text link
    Summarizing book-length documents (>100K tokens) that exceed the context window size of large language models (LLMs) requires first breaking the input document into smaller chunks and then prompting an LLM to merge, update, and compress chunk-level summaries. Despite the complexity and importance of this task, it has yet to be meaningfully studied due to the challenges of evaluation: existing book-length summarization datasets (e.g., BookSum) are in the pretraining data of most public LLMs, and existing evaluation methods struggle to capture errors made by modern LLM summarizers. In this paper, we present the first study of the coherence of LLM-based book-length summarizers implemented via two prompting workflows: (1) hierarchically merging chunk-level summaries, and (2) incrementally updating a running summary. We obtain 1193 fine-grained human annotations on GPT-4 generated summaries of 100 recently-published books and identify eight common types of coherence errors made by LLMs. Because human evaluation is expensive and time-consuming, we develop an automatic metric, BooookScore, that measures the proportion of sentences in a summary that do not contain any of the identified error types. BooookScore has high agreement with human annotations and allows us to systematically evaluate the impact of many other critical parameters (e.g., chunk size, base LLM) while saving $15K and 500 hours in human evaluation costs. We find that closed-source LLMs such as GPT-4 and Claude 2 produce summaries with higher BooookScore than the oft-repetitive ones generated by LLaMA 2. Incremental updating yields lower BooookScore but higher level of detail than hierarchical merging, a trade-off sometimes preferred by human annotators. We release code and annotations after blind review to spur more principled research on book-length summarization

    Decomposing Complex Queries for Tip-of-the-tongue Retrieval

    Full text link
    When re-finding items, users who forget or are uncertain about identifying details often rely on creative strategies for expressing their information needs -- complex queries that describe content elements (e.g., book characters or events), information beyond the document text (e.g., descriptions of book covers), or personal context (e.g., when they read a book). This retrieval setting, called tip of the tongue (TOT), is especially challenging for models heavily reliant on lexical and semantic overlap between query and document text. In this work, we introduce a simple yet effective framework for handling such complex queries by decomposing the query into individual clues, routing those as sub-queries to specialized retrievers, and ensembling the results. This approach allows us to take advantage of off-the-shelf retrievers (e.g., CLIP for retrieving images of book covers) or incorporate retriever-specific logic (e.g., date constraints). We show that our framework incorportating query decompositions into retrievers can improve gold book recall up to 7% relative again for Recall@5 on a new collection of 14,441 real-world query-book pairs from an online community for resolving TOT inquiries

    LIMEADE: A General Framework for Explanation-Based Human Tuning of Opaque Machine Learners

    Full text link
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow humans to tune a model in response to the explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA2Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, no method for tuning opaque models in response to explanations has been user-tested to date. This paper introduces LIMEADE, a general framework for tuning an arbitrary machine learning model based on an explanation of the model's prediction. We demonstrate the generality of our approach with two case studies. First, we successfully utilize LIMEADE for the human tuning of opaque image classifiers. Second, we apply our framework to a neural recommender system for scientific papers on a public website and report on a user study showing that our framework leads to significantly higher perceived user control, trust, and satisfaction. Analyzing 300 user logs from our publicly-deployed website, we uncover a tradeoff between canonical greedy explanations and diverse explanations that better facilitate human tuning.Comment: 16 pages, 7 figure
    • …
    corecore